Fix concurrency/scaling when many Python threads do streaming using sync completions #14816

leventov · 2025-09-23T15:06:02Z

Problem

I have 14 Python threads doing streaming LLM requests concurrently.

Expectation: this should be mostly network-IO-bound workload, so this should scale reasonably well.

Reality: the performance is practically serial, i.e., this executes just a little faster than L * N / nCPU, where L is the latency of one such request (streaming or not), N is the number of requests, and nCPU is the number of (v)CPUs available for the Python process.

Pre-Submission checklist

Please complete all items before asking a LiteLLM maintainer to review your PR

I have Added testing in the tests/litellm/ directory, Adding at least 1 test is a hard requirement - see details
I have added a screenshot of my new test passing locally
My PR passes all unit tests on make test-unit
My PR's scope is as isolated as possible, it only solves 1 specific problem

Type

🆕 New Feature
🐛 Bug Fix

Changes

Guard executor.submit() with if not litellm.disable_streaming_logging in the hot path in streaming_handler.py's __next__(). This is a no-brainer change, since run_success_logging_and_cache_storage() is exactly a no-op if litellm.disable_streaming_logging is True, so submitting a no-op to an executor doesn't make any sense.
Update dependency to httpcore to Don't hold lock unless necessary in PoolByteStream.close() encode/httpcore#1038.
Make sync transport configurable via litellm.sync_transport. I could have avoided this by changing client altogether, but this is a quality of life change.

Then, in my code when I use litellm, I pre-configure the HTTPTransport in the following way:

# This block of code reproduces the default args used in
# litellm.llms.custom_httpx.http_handler.HTTPHandler.__init__() + initial_connections.
from litellm.llms.custom_httpx.http_handler import get_ssl_configuration
ssl_config = get_ssl_configuration(ssl_verify=None)
cert = os.getenv("SSL_CERTIFICATE", litellm.ssl_certificate)
from httpx import Limits
limits = Limits(max_connections=1000, max_keepalive_connections=1000)
from httpx._transports.default import HTTPTransport
from httpcore import Origin
initial_connections = {
    Origin(b"https", b"generativelanguage.googleapis.com", 443): 14
    Origin(b"https", b"openrouter.ai", 443): 14
}
litellm.sync_transport = HTTPTransport(verify=ssl_config, cert=cert, limits=limits)
pool = litellm.sync_transport._pool
for origin, num_conn in initial_connections.items():
    for _ in range(num_conn):
        pool._connections.append(pool.create_connection(origin=origin))

CLAassistant · 2025-09-23T15:06:10Z

All committers have signed the CLA.

vercel · 2025-09-23T16:02:14Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Preview	Comments	Updated (UTC)
litellm	Ready	Preview	Comment	Sep 23, 2025 4:07pm

leventov added 2 commits September 23, 2025 13:50

Don't submit a task to a thread if streaming logging is disabled

6a9ebe8

Make httpx's sync transport configurable

0a5fe10

vercel bot deployed to Preview September 23, 2025 16:07 View deployment

krrishdholakia merged commit 2e3e7de into BerriAI:main Sep 24, 2025
4 of 7 checks passed

leventov mentioned this pull request Sep 24, 2025

Profiling concurrency of streaming sync completions and import warm-up question #14852

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix concurrency/scaling when many Python threads do streaming using sync completions #14816

Fix concurrency/scaling when many Python threads do streaming using sync completions #14816

Uh oh!

leventov commented Sep 23, 2025 •

edited

Loading

Uh oh!

CLAassistant commented Sep 23, 2025 •

edited

Loading

Uh oh!

vercel bot commented Sep 23, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fix concurrency/scaling when many Python threads do streaming using *sync* completions #14816

Fix concurrency/scaling when many Python threads do streaming using *sync* completions #14816

Uh oh!

Conversation

leventov commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Pre-Submission checklist

Type

Changes

Uh oh!

CLAassistant commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vercel bot commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fix concurrency/scaling when many Python threads do streaming using sync completions #14816

Fix concurrency/scaling when many Python threads do streaming using sync completions #14816

leventov commented Sep 23, 2025 •

edited

Loading

CLAassistant commented Sep 23, 2025 •

edited

Loading

vercel bot commented Sep 23, 2025 •

edited

Loading